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A method of synthesizing of an unvoiced speech signal 



The present invention relates to the field of synthesizing of speech or music, 
and more particularly without limitation, to the field of text-to-speech synthesis. 

The function of a text-to-speech (TTS) synthesis system is to synthesize 
speech from a generic text in a given language. Nowadays, TTS systems have been put into • 

5 practical operation fox many applications, such as access to databases through the telephone 
network or aid to handicapped people. One method to synthesize speech is by concatenating 
elements of a recorded set of subunits of speech such as demisyllables orpolyphones. The 
majority of successful commercial systems employ the concatenation of polyphones. The 
polyphones comprise groups of two (diphones), three (triphones) or more phones and may be 

10 determined ftom nonsense words, by segmenting the desired grouping of phones at stable 
spectral regions. In a concatenation based synthesis, the conversation of the transition 
between two adjacent phones is crucial to assure the quality of the synthesized speech. With 
the choice of polyphones as the basic subunits, the transition between two adjacent phones is 
preserved in the recorded subunits, and the concatenation is carried out between similar 

15 phones. 

Before the synthesis, hQwever, the phones must have their duration and pitch 
modified in order to fulfil the prosodic constraints of the new words containing those phones. 
This processing is necessary to avoid the production of a monotonous sounding synthesized 
Speech. In a TTS system, this function is performed by a prosodic module. To allow the 

20 duration and pitch modifications in the recorded subunits, many concatenation based TTS 
systems employ the time-domain pitch-synchronous overlap-add (TD-PSOLA) (B, Moulines 
and P. Chaipentier, "Pitch synchronous waveform processing techniques for text-to-speech 
synthesis using diphones," Speech-Ceaamun., vol, 9, pp. 453-467, 1990) model of synthesis. 
In the TD-PSOLA model, the speech signal is first submitted to a pitch 

25 marking algorithm. This algorithm assigns marks at the peaks of the signal in the voiced 

segments and assigns marks 10 ms apart in the unvoiced segments. The synthesis is made by 
a superposition of Harming windowed segments centered at the pitch marks and extending 
from the previous pitch mark to the next one. The duration modification is provided by 
deleting or replicating some of the windowed segments. The pitch period modification, on 
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the other hand, is provided by increasing or decreasing the superposition between windowed 
segments. 

Despite the success achieved in many commercial TTS systems, the synthetic 
speech produced by using the TD-PSOLA model of synthesis can present some drawbacks, 
mainly under large prosodic variations. 



EP-0363233, US-A- 5,479,564, EP-0706170 disclose PSOLA methods. A 
specific example is also the MBR-PSOLA method as published by T. Dutoit and H. Leich, in 
Speech Communication, Elsevier Publisher, November 1993, vol. 13, N.degree. 3-4, 1993. 
The method described in document U.S. Pat. No. 5,479,564 suggests a means of modifying 
the frequency by overlap-adding short-term signals extracted from this signal. The length of 
the weighting windows used to obtain the short-term signals is approximately equal to two 
times the period of the audio signal and their position within the period can be set to any 
value (provided the time shift between successive windows is equal to the period of the audio 
signal). Document U.S. Pat No. 5,479,564 also describes a means of interpolating 
waveforms between segments to concatenate, so as to smooth out discontinuities. When a 
noisy signal is to be synthesized by means of a known PSOLA method, the signal is repeated 
periodically. This way an unintended periodicity is introduced into the frequency spectrum. 
This is perceived as a metallic sound. This problem occurs for all noisy signals which do not 
have a fundamental frequency, such as unvoiced speech parts or music. An unvoiced speech 
part, like the V sound, has no pitch. The vocal chords are not moving as they do for a voiced 
sound. Instead, a noisy hiss-sound is produced by pushing air through a small opening 
between the vocal chords. Whisper is an example of speech containing only unvoiced parts. 
Where there is no pitch, there is no need to change it However, it can be desirable to change 
the duration of an unvoiced speech part. 



The present invention therefore aims to provide a method of synthesizing a 
signal which enables to modify the duration of unvoiced speech parte or music without 
introducing an unintended periodicity in the signal 

The present invention provides for a method of synthesizing a signal, in 
particular a noisy signal, based on an original signal Further the present invention provides 
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for a computer program product for performing such a synthesis, as well as for a 
corresponding computer system, in particular, a text-to-speech system, 

In accordance with the invention the required pitch bell locations of the signal 
to be synthesized are determined. This is done based on, for example, an assumed frequency 
5 of for example 100 Hz. This chosen frequency corresponds to a pitch period. The required 
pitch bell locations of the signal to synthesized are spaced apart on the time axis by intervals 
having the length of the pitch period. The required pitch bell locations are mapped onto the 
original signal to provide pitch bell locations in the domain of the original signal. The pitch 
bell locations in the domain of the original signal are randomly shifted. Preferably the 
1 0 randomization is performed by shifting the pitch bell locations in the original signal domain 
within +/- the pitch period 

In accordance with an embodiment of the invention the windowing is 
performed by means of a sine-window,. The advantage of a sine-window is that it helps to 
reduce any residual periodicity. In particular using a sine-window is advantageous in that it 
15 ensures that the signal envelope in the power domain remains constant Unlike a periodic 
signal, when two noise samples are added, the total sum can be smaller than the absolute 
value of any one of the two samples. This is because the signals are (mostly) not in-phase. 
The sine-window adjusts for this effect and removes the envelope-modulation, 

20 

In the following, preferred embodiments of the invention are described in . „ 

greater detail by making reference to the drawings in which: 

Kg. 1 is illustrative of a flow chart of an embodiment of the present invention, 
Fig. 2 is illustrative of an example for synthesizing an unvoiced speech signal, 
25 Fig* 3 is a block diagram of a prefetred embodiment of a computer system. 



The flow chart of Fig. 1 is illustrative an embodiment of the method of 
synthesizing a signal. In step 100 an original signal having a duration of y is provided. For 
30 example, the original pignal is a natural speech signal containing unvoiced speech or a music 
signal having a noisy signal characteristic. Further a choice for a fundamental frequency f is 
made even though the original signal does not have such a fundamental frequency because of 
its noisy characteristics. The choice of a frequency f corresponds to a choice of a pitch period 
p. A convenient choice for a frequency f is between 50 Hz and 200 Hz, preferably 100 Hz. In 
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addition the desired duration x of the signal to be synthesized is inputted in step 100. In step 
102 the pitch bell locations in the domain of the signal to be synthesized are determined in 
accordance with the choice of frequency f and pitch period p. This is done by dividing the 
time axis in the domain of the signal to synthesized into intervals of length p. In step 1 04 the 
5 pitch bell locations are mapped from the domain of the signal to be synthesized onto the 
domain of the original signal. When the duration x is longer than the duration y of the 
original signal mis means that the pitch bell locations i in the domain of the original signal 
are spaced apart by intervals which are shorter than the pitch period p. In the opposite case 
the intervals between the pitch bell locations i in the domain of the original signal will be 

10 longer than the intervals between the pitch bell locations and the domain of the signal to be 
synthesized. In step 106 the pitch bell locations i in the domain of the original signal are 
randomized. This can be done by randomly shining each of the pitch bell looation i within an 
interval of +/- p around the original pitch bell location i. A pseudo random number generator 
can be utilized to perioral this randomization. Ia step 108 the windowing is performed in the 

1 5 domain of the original signal Preferably this is done by means of a sine-window which is 
applied on the randomized pitch bell locations i'; this way periodicity is further reduced. In 
step 1 10 the resulting pitch bells are overlapped and added in the domain of the signal to be 
synthesized which provides the synthesized signal. 

Pig. 2 illustrates this signal synthesis by way of example. Time axis 200 is in 

20 the domain of the signal to be synthesized. The required duration x of the signal to be 

synthesized is one second in the example considered here. The assumed frequency f is 100 
Hz, which corresponds to a pitch period p of 1 0 milliseconds. This means that the required 
pitch bell locations in the domain of the signal to be synthesized on time axis 200 are spaced 
apart by intervals of p = 10 milliseconds, i.e. the first pitch bell location is located at zero 

25 seconds on time axis 200, the next pitch bell location is at 1 0 milliseconds, the following at 
20 milliseconds and so on. In other words the pitch bell locations in the domain of the signal 
to be synthesized are determined by points on the time axis 200 which are spaced apart by 
intervals of p starting at time zero. The pitch bell locations on time axis 200 are mapped onto 
time axis 202 in the domain of the original signal. The original signal has a duration of y = 

30 0.5 seconds. As the duration y is smaller than the duration x of the signal to be synthesized 
this means that the pitch bell locations need to be "compressed" on time axis 202. As the 
duration y is half the duration x the intervals of the mapped pitch bell locations on the time 
axis 202 are spaced apart by p/2 instead of p. This means that the first pitch bell location i = l 
is at zero milliseconds on the time axis 202; the following pitch bell looation i - 2 is at 5 
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milliseconds, the next pitch bell location i = 3 is at 10 milliseconds and so on. In other words 
the first pitch bell location at time zero milliseconds on the time axis 200 is mapped onto the 
pitch bell location i — 1 cm the time axis 202 at zero milliseconds; the required pitch bell 
location at 10 milliseconds on the time axis 200 is mapped on the pitch bell location i = 2 at 5 
milliseconds on the time axis 202; the required pitch bell location at 20 milliseconds on the 
time axis 200 is m^ped onto the pitch bell location i = 3 at time 10 milliseconds on the time 
axis 202 and so on. Next the pitch bell locations i are randomized, This is illustrated in figure 
2 with respect to the first pitch bell location i - 1 on the time axis 202. An interval of +/- p 
around zero milliseconds is defined on the time axis 202. Within this interval the pitch bell 

O « 0 

location i = 1 is randomly shifted. For the pitch bell location i - 1 the interval is between -1Q 
milliseconds to +10 milliseconds on the time axis 202, In the example considered here this 
results in a randomized pitch bell location i' at 7.5 milliseconds on the time axis 202. At this 
position the original signal is windowed by means of a window function 204> Preferably the 
following window is used to provide a window function 204. 

f?r.(n + 0.5)^1 



w[*]-sm 



m 



Preferably the randomization of the pitch bell locations i is performed in 
accordance with the following formula: 
i'=i + (Rxp) 

Where i denotes the original pitch bell location on the time axis 202, i* is the 
new pitch bell location after the randomization, R is a random number between -1 and 1 and 
p is the pitch period. The result of the windowing of the original signal is a pitch bell. This 
pitch bell is placed at the first required pitch bell location within the domain of toe signal to 
be synthesized on time axis 200 as illustrated in figure 2. This process is repeated with 
respect to all required pitch bells on the time axis- These pitch bells are added which yields 
the desired synthesized signal of length x. 

Fig. 3 is illustrative of a block diagram of a computer system, such as a text- 
to-speech system. The computer system 300 has a module 302 for storing an original signal 
having a duration of y. Further the computer system 300 has a module 304 for storing a pre- 
selected frequency for pitch p. Module 306 serves to determine required pitch bell locations 
of flue signal to be synthesized based on the required duration x of the signal to be 
synthesized and the pre-selected frequency for pitch p. Module 308 serves to map the 
required pitch bell locations in the domain of the signal to be synthesized onto the domain of 



. CXJZJC XX • tJU I I IXL.XI *J i »- • w- 

PHNL020859EPP 012 17 " 09 ' 2002 " 



6 17.09.2002 
the original signal. This way the pitch bell locations i are determined as illustrated in the 
example of Fig 2. Module 3 10 serves to randomize the pitch bell locations i. Module 310 is 
coupled to module 312 which provides random numbers for the randomization process. 
Module 314 serves to perform the windowing of the original signal on the randomized pitch 
bell locations i'. The resulting pitch bells are then overlapped and added in the domain of the 
signal to be synthesized by mean of module 316. This results in the synthesized s ignal of the 
desired duration y. 
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CLAIMS: 



l * A method of synthesizing a signal comprising the steps o£ 

a ) detennining of a required pitoh bell location, 

b) mapping of the required pitch bell location onto an original signal to provide a 
first pitch bell location, 

5 o) randomly shifting the first pitch bell location to provide a second pitoh bell 

location, 

d) windowing of the original signal on the second pitch bell location to provide a 
pitch bell, 

e) repeating of the steps a) to d) for all required pitch bell locations and 

10 performing an overlap and add operation with respect to the pitch bells in order to synthesize 
the si, 



2. The method of claim 1 the determination of required pitch bell locations being 
performed by dividing the required length of the signal to be synthesized into time intervals, 

15 each of the time intervals having the length of a pitch. 

3. The method of claims 1 or 2, whereby the step of randomizing of the first 
pitch bell location is performed by randomly shifting the first pitoh ball location within an 
interval of +/- the pitch. 

20 

4. The method of any one of the preceding claims 1, 2 or 3, whereby the step of 
randomizing the first pitch bell looation i to provide the second pitch beU location V being 
performed in accordance with the following equation: 

i'=i + (Rxp), 

25 Where R is a random number between- 1 and+ 1 and p is the pitch. 



5» The method of any one of the preceding claims 1 through 4, whereby the 

windowing is performed by mean of a sine-window. 
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6> The methods of any one oftiie preceding claims 1 to 5, whereby the 

windowing is performed by means of the following sine-window Amotion: 



m 



J 



, Q£n<m 



where mis the length ofthe window and n is the running index. 

5 

7> The method of any one ofthe preceding 1 to 6, whereby the original signal 

does not have a fimdamental ftequency, and the original signal, preferably comprising 
unvoiced speech or music. 

10 g f a computer program product, in particular digital storage medium, comprising 

program means for performing the steps of: 
a ) detemiining of a required pitch bell location, 

mapping ofthe required pitch bell location onto an original signal to provide a 
first pitch bell location, 

IS c) i^domiziug the first pitch, bell location to provide a second pitch bell location, 

<j) windowing ofthe ori ginal signal on the second pitch bell location to provide a 

pitchbell, 

e ) repeating ofthe steps a) to d) for all required pitch bell locations and 

performing an overlap and add operation with respect to the pitch bells in order to synthesize 
20 the signal. 



9, A computer system, in particular text-speech synthesis system, for 

synthesizing a signal, the computer system comprising; 

means for determining of required pitch bell locations within the signal to be 

25 synthesized, 

means for mapping ofthe required pitch bell locations onto an original signal 

to provide first pitch bell locations (i), 

means for imdomiztog the first pitch bell locations to provide second pitch 

bell locations (i')» 

30 - means for windowing ofthe original signal on the second pitch bell locations 

to provide pitch bells, 
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means for performing an overlap and add operation with respect to the pitch 
hells in order to synthesize the signal. 

10. A synthesized signal comprising a number of pitch bells which are overlapped 

and added, each of the pitch bells resulting from windowing of an original signal on a second 
pitch bell location (i'), the second pitch bell location having been obtained by randomizing of 
a first pitch bell location (i), which is obtained by mapping of a required pitch bell location 
onto an original signal. 
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ABSTRACT; 



The present invention relates to a method of synthesizing a signal comprising 

the steps o£ 

a ) determining of a required pitch ben location, 

b) mapping of the required pitch bell location onto an original signal to provide a 

5 first pitch bell location, 0 

C ) randomizing the first pitch bell location to provide a second pitch bell location, 

d) windowing of the original signal on the second pitch bell location to provide a 

pitch bell t 

e ) repeating of the steps a) to d) for all required pitch bell locations and 

L0 performing an overlap and add operation with respect to the pitch bells in order to synthesize 
the signal. 

Fig.l 
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